optimal hyperparameter
0be50b4590f1c5fdf4c8feddd63c4f67-Supplemental-Datasets_and_Benchmarks.pdf
In Figure 1 we demonstrate the common neighbor (CN) distribution among positive and negative test samples for ogbl-collab, ogbl-ppa, and ogbl-citation2. These results demonstrate that a vast majority of negative samples have no CNs. Since CNs is a typically good heuristic, this makes it easy to identify most negative samples. We further present the CN distribution of Cora, Citeseer, Pubmed, and ogbl-ddi in Figure 3. The CN distribution of Cora, Citeseer, and Pubmed are consistent with our previous observations on the OGB datasets in Figure 1.
Optimizing Optimizers for Fast Gradient-Based Learning
We lay the theoretical foundation for automating optimizer design in gradient-based learning. Based on the greedy principle, we formulate the problem of designing optimizers as maximizing the instantaneous decrease in loss. By treating an optimizer as a function that translates loss gradient signals into parameter motions, the problem reduces to a family of convex optimization problems over the space of optimizers. Solving these problems under various constraints not only recovers a wide range of popular optimizers as closed-form solutions, but also produces the optimal hyperparameters of these optimizers with respect to the problems at hand. This enables a systematic approach to design optimizers and tune their hyperparameters according to the gradient statistics that are collected during the training process. Furthermore, this optimization of optimization can be performed dynamically during training. Just as optimizers train their models by feeding them parameter velocities θ, models can also fit the optimizers to the underlying tasks by feeding gradients g. We are interested in the problem of designing optimiz-ers that maximize the utility of gradient-based learning for a given task. The process of learning manifests as the parameter motion θ driven by the gradient force g applied at each step t. Physics requires a constitutive law that relates kinematic motion to its motive force. In gradient-based learning, optimizers take that role. We can represent an optimizer as a positive semidefinite operator Q 0 that linearly translates the gradients into the parameter updates, θ = Q g. (1) Later sections will reveal that many existing optimizers fall into this category. Q g. (2) Adhering to the greedy paradigm, we turn our original problem of maximizing the utility of learning into a different optimization problem that maximizes this loss drop with respect to the optimizer Q: maximize Problem P1 reveals two design options that bound this maximum: (1) the trust region implied by the feasible set Q Q, and (2) the gradient distribution under the expectation E. Our main focus is on how these two factors determine the optimal optimizer Q Optimizers and their hyperparameters can be dynamically tuned or even be replaced by better ones according to the intermediate probes from the gradients in the middle of training. By reverse engineering commonly used optimizers, we draw the landscape of optimizers that have driven the success of machine learning (Robbins & Monro, 1951; Kingma & Ba, 2015; Loshchilov & Hutter, 2019; Gupta et al., 2018; Martens & Grosse, 2015) into a single picture. This lets us better use the well-studied optimizers in practice and also suggest extensions to them. Note that Σ is a symmetric and positive semidefinite (PSD) matrix of shape d d.
BO-SA-PINNs: Self-adaptive physics-informed neural networks based on Bayesian optimization for automatically designing PDE solvers
Zhang, Rui, Li, Liang, Lanteri, Stéphane, Kang, Hao, Li, Jiaqi
Physics-informed neural networks (PINNs) is becoming a popular alternative method for solving partial differential equations (PDEs). However, they require dedicated manual modifications to the hyperparameters of the network, the sampling methods and loss function weights for different PDEs, which reduces the efficiency of the solvers. In this paper, we pro- pose a general multi-stage framework, i.e. BO-SA-PINNs to alleviate this issue. In the first stage, Bayesian optimization (BO) is used to select hyperparameters for the training process, and based on the results of the pre-training, the network architecture, learning rate, sampling points distribution and loss function weights suitable for the PDEs are automatically determined. The proposed hyperparameters search space based on experimental results can enhance the efficiency of BO in identifying optimal hyperparameters. After selecting the appropriate hyperparameters, we incorporate a global self-adaptive (SA) mechanism the second stage. Using the pre-trained model and loss information in the second-stage training, the exponential moving average (EMA) method is employed to optimize the loss function weights, and residual-based adaptive refinement with distribution (RAR-D) is used to optimize the sampling points distribution. In the third stage, L-BFGS is used for stable training. In addition, we introduce a new activation function that enables BO-SA-PINNs to achieve higher accuracy. In numerical experiments, we conduct comparative and ablation experiments to verify the performance of the model on Helmholtz, Maxwell, Burgers and high-dimensional Poisson equations. The comparative experiment results show that our model can achieve higher accuracy and fewer iterations in test cases, and the ablation experiments demonstrate the positive impact of every improvement.
Towards Precise Scaling Laws for Video Diffusion Transformers
Yin, Yuanyang, Zhao, Yaqi, Zheng, Mingwu, Lin, Ke, Ou, Jiarong, Chen, Rui, Huang, Victor Shea-Jay, Wang, Jiahao, Tao, Xin, Wan, Pengfei, Zhang, Di, Yin, Baoqun, Zhang, Wentao, Gai, Kun
Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.